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Abstract:  We  describe  the  participation  of  the 
University  of  Amsterdam’s  ILPS  group  in  the  rele¬ 
vance  feedback  track  at  TREC  2008.  We  introduce 
a  new  model  which  incorporates  information  from 
relevant  and  non-relevant  documents  to  improve 
the  estimation  of  query  models.  Our  main  findings 
are  twofold;  (i)  in  terms  of  statMAP,  a  larger  num¬ 
ber  of  judged  non-relevant  documents  improves 
retrieval  effectiveness  and  (ii)  on  the  TREC  Ter¬ 
abyte  topics,  we  can  effectively  replace  the  esti¬ 
mates  on  the  judged  non-relevant  documents  with 
estimations  on  the  document  collection. 

1  Introduction 

In  our  participation  in  the  relevance  feedback  track  this  year, 
our  goal  was  to  explicitly  incorporate  non-relevance  infor¬ 
mation  in  the  estimation  of  query  models.  Working  with 
the  language  modeling  approach  to  information  retrieval,  we 
base  our  model  of  non-relevant  information  on  the  Normal¬ 
ized  Log  Likelihood  Ratio. 

We  discuss  related  work  in  Section  2,  describe  our  re¬ 
trieval  approach  in  Section  3,  and  detail  our  model  for  cap¬ 
turing  non-relevance  in  Section  4.  In  Section  5  we  describe 
our  runs;  we  then  present  our  results  in  Section  6  and  con¬ 
clude  in  Section  7. 

2  Background 

Our  chief  aim  for  participating  in  this  year’s  TREC  Rele¬ 
vance  Eeedback  track  is  to  extend  previous  approaches,  such 
as  the  one  proposed  by  Lavrenko  and  Croft  (2001),  by  ex¬ 
plicitly  incorporating  non-relevance  information.  Such  neg¬ 
ative  evidence  is  usually  assumed  to  be  implicit,  i.e.  in 
the  case  of  estimating  a  model  from  some  (pseudo-)relevant 
data,  the  absence  of  terms  indicates  their  non-relevance  sta¬ 
tus.  This  means,  in  a  language  modeling  setting  and  for 
the  sets  of  relevant  documents  R  and  non-relevant  docu¬ 
ments  R,P{t\djf)  =  1— P(f|0ff).  The  TREC  Relevance  Eeed¬ 
back  track  gives  us  the  opportunity  to  develop  and  evaluate 
models  which  explicitly  capture  non-relevance  information 


and  we  participated  to  answer  the  following  research  ques¬ 
tions.  Can  non-relevance  information  be  effectively  mod¬ 
eled  to  improve  the  estimation  of  a  query  model?  Given  our 
model,  what  is  the  effect  of  the  relative  size  of  the  set  of  non- 
relevant  documents  with  respect  to  the  relevant  documents 
on  retrieval  effectiveness?  And,  finally,  we  ask  the  question 
whether  and  when  explicit  non-relevance  information  helps. 
In  other  words,  what  are  the  effects  when  we  substitute  the 
estimates  on  the  non-relevant  documents  with  more  general 
estimates,  such  as  from  the  collection.  Some  previous  work 
has  already  experimented  with  using  negative  weights  for 
non-relevance  information,  either  in  an  ad-hoc  or  more  prin¬ 
cipled  fashion,  with  mixed  results  (Dunlop,  1997;  Ide,  1971; 
Wang  et  ak,  2008;  Wong  et  ak,  2008). 

The  model  we  propose  leverages  the  distance  between 
each  relevant  document  and  the  set  of  non-relevant  docu¬ 
ments,  by  penalizing  terms  that  occur  frequently  in  the  lat¬ 
ter,  similar  to  the  intuitions  described  by  Wang  et  ak  (2008). 
Instead  of  subtracting  probabilities,  however,  we  take  a  more 
principled  approach  based  on  the  Normalized  Log  Like¬ 
lihood  Ratio  (NLLR).  Moreover,  similar  to  other  pseudo¬ 
relevance  feedback  approaches,  such  as  the  one  proposed 
by  Lavrenko  and  Croft  (2001),  we  reward  terms  that  appear 
frequently  in  the  individual  relevant  documents.  Although 
the  NLLR  is  not  a  true  distance  between  distributions  (since 
it  does  not  satisfy  the  triangle  equality),  we  consider  it  to  be 
a  useful  candidate  for  measuring  the  (dis)similarity  between 
two  probability  distributions. 

3  Retrieval  Framework 

We  employ  a  language  modeling  approach  to  IR  and  rank 
documents  by  their  log-likelihood  of  being  relevant  given  a 
query.  Without  presenting  details  here  we  only  provide  our 
final  formula  for  ranking  documents,  and  refer  the  reader  to 
(Balog  et  ak,  2008)  for  the  steps  of  deriving  this  equation: 

iogP(D|e)  -  logP(D) +i,geP(f|0e)  •iogP(f|0D).  (1) 

Here,  both  documents  and  queries  are  represented  as  multi¬ 
nomial  distributions  over  terms  in  the  vocabulary,  and  are 
referred  to  as  document  model  (Qo)  query  model  (Oq), 
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respectively.  The  third  component  of  our  ranking  model  is 
the  document  prior  (P{D)),  which  is  assumed  to  be  uni¬ 
form.  Note  that  by  using  uniform  priors,  Eq.  1  gives  the 
same  ranking  as  scoring  documents  by  measuring  the  KL- 
divergence  between  the  query  model  Qq  and  each  document 
model  Qo,  in  which  the  divergence  is  negated  for  ranking 
purposes  (Lafferty  and  Zhai,  2001). 

Unless  indicated  otherwise,  we  estimate  each  document 
model  by: 

P(t  |0o)  =  ( 1  -  Xd)  •  P(f  ID)  +  Xo  •  P(f ) ,  (2) 

where  Xd  is  a  parameter  by  that  we  use  to  tune  the  amount 
of  smoothing.  P{t\D)  indicates  the  maximum  likelihood 
estimate  (MLE)  of  term  f  on  a  document,  i.e.,  P{t\D)  = 
n{t,D)/Y,t’n{t',D),  andP(f)  the  MLE  on  the  collection  C: 

P(f)=P(f|C)  =  ^^^^^.  (3) 

As  to  the  query  model  Qq,  we  adopt  the  common  approach 
to  linearly  interpolate  the  initial  query  with  an  expanded 
part  (Balog  et  al.,  2008;  Kurland  et  al.,  2005;  Rocchio,  1971; 
Zhai  and  Lafferty,  2001): 

p(t|0e)  =  Xg  •p(f|0e)  +  (i-Xq)  •p(f|0),  (4) 

where  P{t\Q)  indicates  the  MLE  on  the  initial  query  and  the 
parameter  Xq  controls  the  amount  of  interpolation.  The  main 
goal  of  our  participation  is  to  find  ways  of  improving  the 
query  model  Qq  using  (non-)relevance  information. 


4  Modeling  Non-Relevance 

Kraaij  (2004)  defines  the  NLLR  measure  as  being  equivalent 
to  determining  the  negative  KL-divergence  for  document  re¬ 
trieval.  It  is  formulated  as: 


NLLR(e|D)  =  H{Qq,Qc)-H{Qq,Qd),  (5) 

where  H{Q,Q’)  is  the  cross-entropy  between  two  multino¬ 
mial  language  models: 


D(0,0') 


//(0)-fKL(0||0') 
-^P(f|0)logP(f|0)  + 

t 


^P(f|0)log 


P(f|0') 


-^P(t|0)logP(f|0'). 


Eq.  5  can  be  interpreted  as  the  relationship  between  two  lan¬ 
guage  models  Qq  and  Qd,  normalized  by  a  third  language 
model  Qc  (these  three  models  are  estimated  using  Eq.  4, 
Eq.  2,  and  Eq.  3  respectively).  The  NLLR  is  a  measure  of 
average  surprise;  the  better  a  document  model  ‘fits’  a  query 


distribution,  the  higher  the  score  will  be;  H{Qq,Qd)  will 
be  smaller  than  H{Qq,Qc)  for  relevant  documents.  In  other 
words,  the  smaller  the  cross  entropy  between  the  query  and 
document  model  (i.e.,  when  the  document  language  model 
better  fits  the  observations  from  the  query  language  model), 
the  higher  it  will  be  ranked. 

Based  on  the  NLLR  measure,  we  have  developed  the  fol¬ 
lowing  model  by  which  we  estimate  P(f|0g)  in  Eq.  4.  The 
intuition  is  to  determine  for  each  term,  the  probability  that 
it  was  sampled  from  each  relevant  document  as  well  as  the 
probability  that  it  was  sampled  from  the  set  of  non-relevant 
documents: 


P(f|0g)  oc  ^  P(f|0o)P(0o|0«), 

DeR 


We  weigh  each  term  by  the  distance  between  R  and  R  and  its 
importance  in  the  current  document  by  setting: 


R(0d|0«) 


NLLR(D|R) 

i:o,NLLR(D'|R)’ 


(6) 


where 


NLLR(D|R) 


H{Qd,Qr)  —H{Qr,Qd) 


^P(f|0Z,)log 


P{t\QR) 

P(f|0«) 


(7) 


^P(t|0B)log 


{l-5i)P{t\R)  +  5iP{t) 
(l-52)P(f|R)  +  52P(f)’ 


The  5  parameters  provide  us  with  the  means  to  control  the 
individual  influence  of  each  set  of  relevant  and  non-relevant 
documents  versus  a  background  model.  P{t\R)  and  P{t\R) 
are  estimated  by  considering  the  MLE  on  the  documents  in 
the  respective  set,  i.e.,  for  the  set  of  relevant  documents  R: 


5  Runs 

We  submitted  2  runs,  each  consisting  of  5  separate  runs  (one 
for  each  set  of  provided  relevance  judgements).  The  capi¬ 
tal  letters  in  each  run  indicate  the  relevance  judgements  per 
topic  used  for  that  run:  (A)  no  relevance  judgements,  (B)  3 
relevant  documents,  (C)  3  relevant  and  3  non-relevant  doc¬ 
uments,  (D)  10  judged  documents  (division  unknown),  (E) 
large  set  of  judgements  (division  and  number  unknown). 

We  have  followed  the  following  intuition  for  our  submis¬ 
sions:  given  that  we  have  knowledge  on  which  documents 
are  relevant  and  not  relevant  to  the  query,  can  we  use  this  in¬ 
formation  to  obtain  a  better  estimate  of  our  query  model?  We 
hypothesize  that  our  model  gains  the  most  when  the  set  of 
non-relevant  documents  is  large  enough  to  give  a  proper  es¬ 
timate  on  non-relevance.  We  expect  the  background  collec¬ 
tion  to  be  a  better  estimate  of  non-relevance  when  the  set  of 


Set 

MAP 

P5 

PIO 

A 

0.1364 

0.2516 

0.2452 

met6 

B 

0.1732'^ 

0.2645 

0.2677 

met6 

C 

0.1568 

0.3484 

0.3129 

met6 

D 

0.1584 

0.3097 

0.3129 

met6 

E 

0.1689 

0.2645 

0.2677 

met9 

B 

0.1769"^ 

0.3161 

0.3194 

met9 

C 

0.1699"^ 

0.3161 

0.3032 

met9 

D 

0.1738"^ 

0.4000^ 

0.3710"^ 

met9 

E 

0.1959"^ 

0.2903 

0.2871 

Table  1:  Evaluation  on  the  31  TREC  Terabyte  topics  (top  10): 
significance  tested  against  the  baseline  (set  A). 

judged  non-relevant  documents  is  small,  but  expect  to  obtain 
an  increasingly  good  estimate  using  the  non-relevant  docu¬ 
ments  as  the  size  of  this  set  increases.  Thus,  we  compare  our 
model  using  explicit  non-relevance  information  to  the  same 
model  using  the  collection  as  a  non-relevance  model,  by  sub¬ 
mitting  two  distinct  runs:  met6,  using  the  set  of  non-relevant 
documents,  and  met9,  using  only  the  collection  (82=  1,  viz. 
Eq.  7). 

Preprocessing  and  Parameter  settings  We  did  not  per¬ 
form  any  preprocessing  of  the  data  besides  standard  stop- 
word  removal  and  stemming  using  a  Porter  stemmer.  Eor 
our  models  we  need  to  estimate  four  parameters:  5i,  82,  A-d, 
and  Xq.  We  have  used  the  odd  numbered  topics  from  the 
TREC  Terabyte  track  (topics  701-850)  and  from  the  TREC 
Million  Query  track  (topics  1-10000)  as  training  data.  We 
have  performed  sweeps  (with  steps  of  0.1)  over  possible  val¬ 
ues  for  these  parameters  and  select  the  parameter  settings 
with  the  highest  resulting  MAP  scores.  The  resulting  set  of 
parameters  that  we  have  used  for  met6  is  given  by:  Xd  —  0.2, 
Xq  =  0.4,  5i  =  0.2,  and  82  =  0.6.  The  settings  for  met9  are: 
Xo  =  0.2,  Xg  =  0.4,  and  5i  =  0.2. 

6  Results  and  Discussion 

The  results  of  our  10  individual  runs  are  listed  in  Table  1  and 
Table  2.  Eigures  1  and  2  further  illustrate  the  differences  on 
the  31  TREC  Terabyte  topics.  We  use  the  Wilcoxon  signed- 
rank  test  to  test  for  significant  differences  between  runs  and 
report  on  significant  increases  (or  drops)  for  p  <  .01  using 
^(and  ^)  and  for  p  <  .05  using  ^(and  ^).  Note  that  the  base¬ 
line  runs  (set  A)  are  the  same  for  both  methods,  since  neither 
uses  any  kind  of  relevance  information.  The  same  holds  for 
set  B:  in  this  set  only  relevant  information  is  available  and 
the  two  methods  should  therefore  result  in  the  same  scores. 
Due  to  a  small  bug  in  the  implementation,  however,  param¬ 
eter  82  was  not  properly  normalized,  causing  a  slight  differ¬ 
ence  in  the  retrieval  results  for  met6  on  set  B. 


setA 

setB 

setC 

setD 

setE 

met6 

met9 

0.2289 

0.2289 

0.2595^ 

0.2608^ 

0.2750^ 

0.2787^ 

0.2758^ 

0.2777^ 

0  0 

00  00 

0 

►  ► 

Table  2:  Evaluation  with  statMAP:  significance  tested 
against  baseline  (set  A). 

As  stated  earlier,  we  submitted  our  runs  to  explore  three 
main  research  questions: 

•  Can  non-relevance  information  be  effectively  modeled 
to  improve  the  estimation  of  a  query  model? 

•  What  is  the  effect  of  the  relative  size  of  the  set  of  non- 
relevant  documents  with  respect  to  the  relevant  docu¬ 
ments  on  retrieval  effectiveness? 

•  What  are  the  effects  when  we  substitute  the  estimates 
on  the  non-relevant  documents  with  more  general  esti¬ 
mates,  such  as  from  the  collection. 

The  results  reported  in  Table  1  and  Eigure  2  with  respect 
to  met6  give  an  answer  to  the  first  question.  In  all  con¬ 
ditions,  i.e.,  in  all  three  measures  as  well  as  for  different 
relevance  feedback  sets,  the  retrieval  performance  improves 
over  the  baseline,  which  confirms  that  our  model  can  effec¬ 
tively  incorporate  non-relevance  information  for  query  mod¬ 
eling.  Given  a  limited  amount  of  non-relevant  documents 
(sets  C  and  D),  our  model  especially  improves  early  preci¬ 
sion,  although  not  significantly.  A  larger  amount  of  non- 
relevant  documents  (set  E)  decreases  overall  retrieval  effec¬ 
tiveness.  Erom  Eigure  2a  we  observe  that  set  E  only  outper¬ 
forms  the  other  sets  at  the  very  ends  of  the  graph.  Eigure  1 
shows  a  per-topic  breakdown  of  the  difference  in  MAP  be¬ 
tween  the  two  submitted  runs.  We  observe  that  most  topics 
are  helped  more  using  the  collection-based  estimates.  We 
have  to  conclude  that,  for  the  TREC  Terabyte  topics,  the  es¬ 
timation  on  the  collection  yields  the  highest  retrieval  perfor¬ 
mance  and  is  thus  a  better  estimate  of  non-relevance  than  the 
judged  non-relevant  documents. 

When  we  zoom  out  and  look  at  the  full  range  of  avail¬ 
able  topics  (Table  2),  we  observe  that  both  models  improve 
StatMAP  over  the  baseline  (set  A)  for  the  full  set  of  topics. 
When  the  feedback  set  is  small,  met9  improves  statMAP 
more  effectively  than  met6,  i.e.,  the  background  model  is 
performing  better  than  the  non-relevant  documents.  On 
the  largest  set  of  feedback  documents  (set  E)  met6  obtains 
the  highest  statMAP  score  (although  the  difference  with 
met9  is  not  significantly  different  for  this  set,  tested  using  a 
Wilcoxon  sign  rank  test).  The  difference  does  seem  to  sug¬ 
gest  that  the  amount  of  non-relevance  information  needs  to 
reach  a  certain  size  to  outperform  the  estimation  on  the  col¬ 
lection.  Since  we  select  the  terms  that  are  most  likely  to 
be  sampled  from  the  distribution  of  the  relevant  documents 
rather  than  non-relevant  documents,  it  is  crucial  that  the  un¬ 
derlying  relevant  and  non-relevant  distributions  can  be  accu- 
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Figure  1:  Per  topic  difference  in  MAP  between  met6  and  met9  on  the  31  TREC  Terabyte  topics  and  the  various  sets  of 
relevance  feedback  information  (a  positive  value  indicates  that  met9  outperforms  met6  and  vice  versa).  The  labels  indicate  the 
respective  topic  identifiers. 


Figure  2:  Precision-recall  plots  of  met6  (a)  and  met9  (b)  on  the  various  feedback  sets  and  the  3 1  TREC  Terabyte  topics  (top  10). 


rately  estimated.  While  the  relevant  documents  are  topically 
concentrated,  i.e.,  they  are  all  related  to  a  given  query,  the 
non-relevant  documents  can  be  topically  diverse  and  there¬ 
fore  more  difficult  to  be  estimated  when  the  number  of  ex¬ 
amples  is  limited.  The  background  information  is  gener¬ 
ally  a  good  approximation  of  the  distribution  of  non-relevant 
documents,  given  that  most  of  the  documents  in  the  collec¬ 
tions  are  not  relevant.  On  the  other  hand,  as  the  size  of  the 
set  of  non-relevant  examples  increases,  especially  the  query- 
specific  top-ranked  non-relevant  documents,  we  can  more 
accurately  estimate  the  true  distribution  of  the  non-relevant 
information,  which  enables  our  model  to  have  more  discrim¬ 
inative  power.  Where  this  cut-off  point  lies  remains  a  topic 
for  future  work. 


7  Conclusion 

The  results  presented  here  provide  us  with  mixed  evidence 
regarding  the  hypothesis  we  stipulated  in  Section  5.  Some 


of  the  presented  results  (statMAP  and  Figure  2a)  confirm 
the  premise  that,  using  met6,  a  larger  number  of  judged  non- 
relevant  documents  improve  retrieval  effectiveness  most.  On 
the  other  hand,  the  overall  results  obtained  on  the  3 1  TREC 
Terabyte  topics  suggest  that  the  collection  is  a  viable  and 
sufficient  alternative. 

We  would  like  to  further  explore  the  problem  in  two  di¬ 
rections.  First,  we  intend  to  investigate  the  impact  of  the 
available  judged  (non-)relevant  documents  and  their  proper¬ 
ties  with  respect  to  the  estimates  on  the  collection.  Second, 
given  the  relevance  assessments,  we  will  try  to  find  better 
ways  of  estimating  the  true  distribution  of  the  (non-)relevant 
information  within  our  framework.  We  believe  that,  instead 
of  using  maximum  likelihood  estimates,  more  sophisticated 
estimation  methods  may  be  explored  and  applied. 
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